perf: build partition filter with balanced tree to avoid RecursionError#3264
Conversation
geruh
left a comment
There was a problem hiding this comment.
thanks for raising @yingjianwu98! I was able to test this locally and run into some recursion errors. Just left a non blocking comments.
| match_partition_expression = And(match_partition_expression, predicate) | ||
| expr = Or(expr, match_partition_expression) | ||
| return expr | ||
| partition_fields = [schema.find_field(f.source_id).name for f in spec.fields] |
There was a problem hiding this comment.
nit: don't need to make this style diff so we don't lose blame history.
| def test_build_large_partition_predicate(table_v2: Table) -> None: | ||
| """A left-folded Or chain over 5000 records would be depth-5000 and crash bind() | ||
| (Python's default recursion limit is ~1000). The balanced tree has depth ~14.""" | ||
| from pyiceberg.expressions.visitors import bind |
There was a problem hiding this comment.
nit: can we import these at the class level?
|
|
||
|
|
||
| def test_build_large_partition_predicate(table_v2: Table) -> None: | ||
| """A left-folded Or chain over 5000 records would be depth-5000 and crash bind() |
There was a problem hiding this comment.
Nit: idk how I feel about describing the old behavior in the docstring. I think it would be better to describe the current situation, i.e.
"""Partition predicates over many records must bind without RecursionError."""But at that point the test name already says that and makes me wonder if the docstring is needed maybe we could get rid of it. Again non-blocking and thinkning out loud.
28b54d6 to
d21f99a
Compare
|
Thanks for the review @geruh and I have updated the comments accordingly! |
Rationale for this change
We recently encountered recursion depth errors in _build_partition_predicate when performing overwrites involving a large number of partitions.
One way to address this is to switch to a balanced-tree implementation, which is already used in several other parts of the PyIceberg codebase to avoid recursion issues. This change aligns _build_partition_predicate with those existing patterns. For reference, see for example:
#1830
Are these changes tested?
Yes, with existing tests and also added another test to ensure it works with large number of input partitions
Are there any user-facing changes?
No